Counting Word Frequencies with Python

The debates that occur in the Canadian Parliament are transcribed and published in a document known as Hansard. Since 2006, Hansard has been available for public download in .xml file format. To date, this archive contains over 57 million words. By converting these transcripts from .xml to .txt, the files can be processed in numerous ways with the Python coding language.</p>

The purpose of this document is to illustrate the counting of specific words in the collection of text files that will be referred to from now on as the corpus</span>. Terms that may be unfamiliar to the reader are highlighted in bold, and contain a definition that can be accessed by hovering the mouse pointer over the word. While it is not necessary to read every line of code to understand the process, explanatory sections within the code are marked with a # and coloured light blue.

Part 1: Determining the word frequency for a single file

Before we can begin working with a piece of text, it must be loaded into and read by Python. Rather than altering (and potentially irreversibly changing) our original text file, we will work with the contents of the file as a string of text contained within a variable. Now we can close our original text file, keeping it intact.

In the next piece of code we will open the file that contains the textual transcripts of all of the Parliamentary debates that occured in the House of Commons during the 39th sitting of Parliament.

While working with one long string may be useful for other applications, our purposes require that we split the text into pieces. In this case, each word will become it's own unique string.


In [1]:
# 1. open the text file
infile = open('data/39.txt')
# 2. read the file and assign it to the variable 'text'
text = infile.read()
# 3. close the text file
infile.close()
# 4. split the variable 'text' into distinct word strings
words = text.split()

Now that we have loaded our file, we can begin to work on it. Python offers us a lot of pre-built tools to make the task of coding easier. Some of the most commonly used tools are known as functions. Functions are useful for automating tasks that would otherwise require a repetitive amount of coding. While Python has many built-in functions, the language's true power comes from the ability to define unique functions based on programming needs. In the code above, we've already used four Python functions: open, read, close, and split.</p>

In the next piece of code we will define our own funtion, called count_in_list. This function will allow us to count the occurence of any word in the corpus.


In [2]:
# 5. define the'count_in_list' function
def count_in_list(item_to_count, list_to_search):
    "Counts the number of a specified word within a list of words"
    number_of_hits = 0
    for item in list_to_search:
        if item == item_to_count:
            number_of_hits += 1
    return number_of_hits

Now we can call the function for any word we choose. The next example shows that there are 392 occurences of the word privacy contained in the transcripts for the 39th sitting of Parliament.


In [3]:
# 6. here the function counts the instances of the word 'privacy'
print "Instances of the word \'privacy\':", (count_in_list("privacy", words))


Instances of the word 'privacy': 392

Unfortunately, there are two distinct problems here, centred around the fact that our function is only counting the string privacy exactly as it appears.</p>

The first problem is that text strings are case-sensitive. If the word contains UPPERCASE and lowercase letters, the word that is searched for will only be counted if the cases match exactly. The following example counts the number of instances of Privacy with the first letter capitalized.


In [4]:
print "Instances of the word \'Privacy\':", (count_in_list("Privacy", words))


Instances of the word 'Privacy': 454

Here is a more extreme example to illustrate the point.


In [5]:
print "Instances of the word \'pRiVaCy\':",(count_in_list("pRiVaCy", words))


Instances of the word 'pRiVaCy': 0

The second problem is that of punctuation. Much like words are case-sensitive, they are also punctuation-sensitive. If a piece of punctuation has been included in the string, it will be included in the search. Here we count the occurrences of privacy, shown here with a comma after the word.


In [6]:
print "Instances of the word \'privacy,\':", (count_in_list("privacy,", words))


Instances of the word 'privacy,': 41

And here we count privacy., with the word followed by a period.


In [7]:
print "Instances of the word \'privacy.\':",(count_in_list("privacy.", words))


Instances of the word 'privacy.': 65

We could comb through the text to find all of the different instantiations of privacy, and then run the code for each one and add together all of the numbers, but that would be time consuming and potentially inaccurate. Instead, we must process the text further to make the text uniform. In this case we want to make all of the characters lowercase, and remove all of the punctuation.</p>

Python has a function that will do this for us.

We will reload the text file, split the text into distinct words or tokens and then use the the text cleaning function.


In [8]:
infile = open('data/39.txt')
text = infile.read()
infile.close()
tokens = text.split()

#here we call the text cleaning function
words = [w.lower() for w in tokens if w.isalpha()]

Now, when we count the instances of privacy, we are presented with a total of 846 instances.


In [9]:
print "Instances of the word \'privacy\':", (count_in_list("privacy", words))


Instances of the word 'privacy': 846

Part 2: Determining Word Frequencies for the Entire Corpus

Now let's see how this compares to the rest of the corpus. To accomplish this, we must write another function that will read all of the text files in our file folder.

First we need to introduce a function of Python that we've yet to see: modules. Modules are packages of functions and code that serve specific purposes. These are much like functions, but more complex.

The next piece of code imports a module called os, specifically the function listdir. We will use listdir to print a list of all the files in a specific directory. Each of the listed files corresponds to a textual transcript of Hansard. The first nine files refer to the complete transcript for each year from 2006 to 2014, while the last three files are the transcripts corresponding to each sitting of Parliament, in this case the 39th through to the end of the second sitting of the 41st Parliament.


In [10]:
# imports the os module
from os import listdir
# calls the listdir function to list the files in a specific directory
listdir("data")


Out[10]:
['.DS_Store',
 '2006.txt',
 '2007.txt',
 '2008.txt',
 '2009.txt',
 '2010.txt',
 '2011.txt',
 '2012.txt',
 '2013.txt',
 '2014.txt',
 '2015.txt',
 '39.txt',
 '40.txt',
 '41.txt']

Although we can display the contents of a directory by using the listdir function, Python needs those names stored in a list in order to iterate over it. We also want to specify that only files with the extension .txt are included. Here we create another function called list_textfiles.


In [11]:
def list_textfiles(directory):
    "Return a list of filenames ending in '.txt'"
    textfiles = []
    for filename in listdir(directory):
        if filename.endswith(".txt"):
            textfiles.append(directory + "/" + filename)
    return textfiles

Rather than writing code to open each file individually, we can create another custom function to open the file we pass to it. We'll call this one read_file.


In [12]:
def read_file(filename):
    "Read the contents of FILENAME and return as a string."
    infile = open(filename)
    contents = infile.read()
    infile.close()
    return contents

Now we can open all of the files in our directory, strip each file of uppercase letters and punctuation, split the whole of each text into tokens, and store all the data as separate lists in our variable corpus.


In [13]:
corpus = []
for filename in list_textfiles("data"):
    # reads the file
    text = read_file(filename)
    # splits the text into tokens
    tokens = text.split()
    # removes the punctuation and changes Uppercase to lower
    words = [w.lower() for w in tokens if w.isalpha()]
    # creates a set of word lists for each file
    corpus.append(words)

Let's check to make sure the code worked by using the len function to count the number of items in our corpus list.


In [14]:
print"There are", len(corpus), "files in the list, named: ", ', '.join(list_textfiles('data'))


There are 13 files in the list, named:  data/2006.txt, data/2007.txt, data/2008.txt, data/2009.txt, data/2010.txt, data/2011.txt, data/2012.txt, data/2013.txt, data/2014.txt, data/2015.txt, data/39.txt, data/40.txt, data/41.txt

Let's create a function to make the names of the files more readable. First we'll have to strip the the file extension .txt.


In [15]:
from os.path import splitext

def remove_ext(filename):
    "Removes the file extension, such as .txt"
    name, extension = splitext(filename)
    return name

for files in list_textfiles('data'):
    remove_ext(files)

Now let's make a function to remove the data/.


In [16]:
from os.path import basename

def remove_dir(filepath):
    "Removes the path from the file name"
    name = basename(filepath)
    return name
for files in list_textfiles('data'):
    remove_dir(files)

And finally, we'll write a function to tie the two functions together.


In [17]:
def get_filename(filepath):
    "Removes the path and file extension from the file name"
    filename = remove_ext(filepath)
    name = remove_dir(filename)
    return name


filenames = []
for files in list_textfiles('data'):
    files = get_filename(files)
    filenames.append(files)

Now we can display a readable list of the files within our directory.


In [18]:
print"There are", len(corpus), "files in the list, named:", ', '.join(filenames),"."


There are 13 files in the list, named: 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 39, 40, 41 .

The next step involves iterating through both lists: corpus and filenames, in order to generate a word frequency for each file in the corpus. For this we will use Python's zip function.


In [19]:
for words, names in zip(corpus, filenames):
    print"Instances of the word \'privacy\' in",names, ":", count_in_list("privacy", words)


Instances of the word 'privacy' in 2006 : 356
Instances of the word 'privacy' in 2007 : 258
Instances of the word 'privacy' in 2008 : 252
Instances of the word 'privacy' in 2009 : 612
Instances of the word 'privacy' in 2010 : 533
Instances of the word 'privacy' in 2011 : 624
Instances of the word 'privacy' in 2012 : 552
Instances of the word 'privacy' in 2013 : 918
Instances of the word 'privacy' in 2014 : 1567
Instances of the word 'privacy' in 2015 : 806
Instances of the word 'privacy' in 39 : 846
Instances of the word 'privacy' in 40 : 1643
Instances of the word 'privacy' in 41 : 3989

What's exciting about this code is that we can now search the entire corpus for any word we choose. Let's search for information.


In [20]:
for words, names in zip(corpus, filenames):
    print"Instances of the word \'information\' in",names, ":", count_in_list("information", words)


Instances of the word 'information' in 2006 : 2724
Instances of the word 'information' in 2007 : 2328
Instances of the word 'information' in 2008 : 1610
Instances of the word 'information' in 2009 : 3554
Instances of the word 'information' in 2010 : 3913
Instances of the word 'information' in 2011 : 3250
Instances of the word 'information' in 2012 : 4137
Instances of the word 'information' in 2013 : 3281
Instances of the word 'information' in 2014 : 4999
Instances of the word 'information' in 2015 : 2728
Instances of the word 'information' in 39 : 6615
Instances of the word 'information' in 40 : 9688
Instances of the word 'information' in 41 : 16221

How about ethics?


In [21]:
for words, names in zip(corpus, filenames):
    print"Instances of the word \'ethics\' in",names, ":", count_in_list("ethics", words)


Instances of the word 'ethics' in 2006 : 456
Instances of the word 'ethics' in 2007 : 177
Instances of the word 'ethics' in 2008 : 680
Instances of the word 'ethics' in 2009 : 333
Instances of the word 'ethics' in 2010 : 473
Instances of the word 'ethics' in 2011 : 211
Instances of the word 'ethics' in 2012 : 285
Instances of the word 'ethics' in 2013 : 615
Instances of the word 'ethics' in 2014 : 339
Instances of the word 'ethics' in 2015 : 156
Instances of the word 'ethics' in 39 : 1304
Instances of the word 'ethics' in 40 : 911
Instances of the word 'ethics' in 41 : 1510

While word frequencies, by themselves, do not give us a tremendous amount of contextual information, they are a valuable first step in conducting large scale text analyses. For instance, returning to our frequency list for privacy, we can observe a general trend suggesting that the use of privacy has been increasing between 2006 and now. It is important to note that our calculations are a raw number. For a more contextual analysis we could calculate how many times the Parliament was in session during each period, or perhaps we could compare the word privacy to the total amount of words in each file.</p>

Stay tuned for the next section: Adding Context to Word Frequency Counts